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Abstract 



We consider the problem of finding the graph on which an epidemic cascade spreads, given only the 
times when each node gets infected. While this is a problem of importance in several contexts - offline 
and online social networks, e-commerce, epidemiology, vulnerabilities in infrastructure networks - there 
has been very little work, analytical or empirical, on finding the graph. Clearly, it is impossible to do 
so from just one cascade; our interest is in learning the graph from a small number of cascades. 

For the classic and popular "independent cascade" SIR epidemics, we analytically establish the num- 
ber of cascades required by both the global maximum-likelihood (ML) estimator, and a natural greedy 
algorithm. Both results are based on a key observation: the global graph learning problem decouples 
into n local problems - one for each node. For a node of degree d, we show that its neighborhood can be 
reliably found once it has been infected 0{cP logn) times (for ML on general graphs) or 0{d\ogn) times 
(for greedy on trees). We also provide a corresponding information-theoretic lower bound of Q{dlogn); 
thus our bounds are essentially tight. Furthermore, if we are given side-information in the form of a 
super-graph of the actual graph (as is often the case), then the number of cascade samples required - 
in all cases - becomes independent of the network size n. 

Finally, we show that for a very general SIR epidemic cascade model, the Markov graph of infection 
times is obtained via the moralization of the network graph. 

Keywords: Epidemics, cascades, network inverse problems, structure learning, sample complexity, 
Markov random fields 



1 Introduction 

Cascading, or epidemic, processes are those where the actions, infections or failure of certain nodes increase 
the susceptibility of other nodes to the same; this results in the successive spread of infections / failures 
/ other phenomena from a smah set of initial nodes to a much larger set. Initially developed as a way 
to study human disease propagation, cascade or epidemic processes have recently emerged as popular and 
useful models in a wide range of application areas. Examples include 

(a) social networks: cascading processes provide natural models for understanding both the consumption 
of online media (e.g. viral videos, news articles[T3]) and spread of ideas and opinions (e.g. trending of 
topics and hashtags on Twitter /Facebook|24j . keywords on blog networksjT]) 

(b) e-commerce: understanding epidemic cascades (and, in this case, finding influential nodes) is crucial to 
viral marketing [9j, and predicting/optimizing uptake on social buying sites like Groupon etc. 



(c) security and reliability: epidemic cascades model both the spread of computer worms and malware [TO], 
and cascading failures in infrastructure networks [^3] and complex organizations |18] . 

(d) peer-to-peer networks: epidemic protocols, where users sending and receiving (pieces of) files in a 



random uncoordinated fashion, form the basis for many popular peer-to-peer content distribution, caching 
and streaming networks [14^ [3] . 

Structure Learning: The vast majority of work on cascading processes has focused on understanding 
how the graph structure of the network (e.g. power laws, small world, expansion etc.) affects the spread of 
cascades. We focus on the inverse problem: if we only observe the states of nodes as the cascades spread, 
can we infer the underlying graph ? Structure learning is the crucial first step before we can use network 
structure; for example, before we find influential nodes in a network (e.g. for viral marketing) we need to 
know the graph. Often however we may only have crude, prior information about what the graph is, or 
indeed no information at all. 

For example, in online social networks like Twitter or Facebook, we may have access to a nominal graph 
of all the friends of a user. However, clearly not all of them have an equal effect on the user's behavior; 
we would like to find the sub-graph of important links. In several other settings, we may have no a-priori 
information; examples include information forensics that study the spread of worms, and offline settings 
like real- world epidemiology and social science. The standard practice seems to be to use crude/nominal 
subgraphs if they exist (e.g. Twitter), or find graphs by other means (e.g. surveys). We propose to take a 
data-driven approach, finding graphs from observations of the cascades themselves. 

While structure learning from cascades is an important primitive, there has been very little work 
investigating it (we summarize below). There are two related issues that need to be addressed: (a) 
algorithms: what is the method, and its complexity, and (b) performance: how many observations are 
needed for reliable graph recovery? The main intellectual contribution of this paper is characterizing the 
performance of two algorithms we develop, and a lower bound showing they perform close to optimal. 
To the best of our knowledge, there exists no prior work on performance analysis (i.e. characterizing the 
number of observations needed) for learning graphs of epidemic cascades. 

1.1 Summary of Our Results 

We present two algorithms, and information-theoretic lower bounds, for the problem of learning the graph 
of an epidemic cascade when we are given prior information of a super-grapfQ It is not possible to learn 
the graph from a single cascade; we study the number of cascades required for reliable learning. Key 
outcomes of our results are that (i) epidemic graph learning can be done in a fast, distributed fashion, (ii) 
with a number of samples that is close to the lower bound. Our results: 

(a) Maximum Likelihood: We show that, via a suitable change of variables, the problem of finding 
the graph most likely to generate the cascades we observe decouples into n convex problems - one for each 
node. Further, for node i, the algorithm requires as input only the infection times of that node's size-Dj 
super-neighborhood; it is local both in computation and in the information requirement. Our main result 
here is to establish that for this efficient algorithm, if di is the size of the true neighborhood, then node i 
needs to be infected 0{df log Di) times before we learn it, for a general graph. 

(b) Greedy algorithm: We show that if the graph is a tree, then a natural greedy algorithm is 
able to find the true neighborhood of a node i with only O(djlogZ)i) samples. The greedy algorithm 
involves iteratively adding to the neighborhood the node which "explains" (i.e. could be the likely cause 
of) the largest number of instances when node i was infected, and removing those infections from further 

^Of course if no super-graph is given, it can be taken to be the complete graph. 
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consideration. 

(c) Lower bounds: We first establish a general information-theoretic lower bound on the number of 
cascade samples required for approximate graph recovery, for general (but abstract) notions of approxima- 
tion, and for any SIR process. We then derive two corollaries: one for learning a graph upto a specified edit 
distance when there is no super-graph information, and another for the case when there is a super-graph, 
and specified edit distances for each of the nodes. These bounds show that the ML algorithm is at most a 
factor d away from the optimal. 

(d) Markov structure of general cascades: Every set of random variables has an associated 
Markov graph. In our final result, we show that for a very general SIR epidemic cascade model - essentially 
any that is causal with respect to time and the directed network graph ~ the (undirected) Markov graph 
of the (random) infection times is the moralized graph of the true directed network graph on which the 
epidemic spreads. This allows for learning graph structure using techniques from Markov Random Fields 
/ graphical models, and also illustrates the role of causality. 

While here we used the O(-) and f](-) notation for compact statement, we emphasize that our results 
are non- asymptotic, and thus more general than a merely asymptotic result. Thus for fixed values of 
system parameters and probabilities of error, we give precise bounds on the number of cascades we need to 
observe. If one is interested in asymptotic results under particular scaling regimes for the parameters, such 
results can be derived as corollaries of our algorithms (with union bounds if one is interested in complete 
graph recovery). 

A nice feature of our results is that both the algorithms work on a node by node basis. Thus for 
recovering the neighbors of a node we only need information about its super-neighborhood, and solve 
a local problem. We are also able to find the neighborhood of one or a few nodes, without worrying 
about finding the neighborhoods of other nodes or the entire graph. Similarly, the number of samples 
required to recover the neighborhood of a node depend only on the sizes of its own neighborhood and 
super- neighborhood. 

1.2 Related Work 

Learning graphs of epidemic cascades: While structure learning from cascades is an important prim- 
itive, there has been very little work investigating it: 

(a) algorithms: A recent paper investigates learning graphs from infection times for the independent 
cascade model (similar setting as our paper). However, they take an approach that results in an NP-hard 
combinatorial optimization problem, which they show can be approximated. Another paper [16j shows 
max-likelihood estimation in the independent cascade model can be cast as a decoupled convex optimiza- 
tion problem (albeit a different one from ours). 

(b) performance: To the best of our knowledge, there has been no work on the crucial question of how 
many cascades one needs to observe to learn the graph; indeed, this question is the main focus of our 
paper. 

Markov graph structure learning: The ideas in this paper are related to those from Markov Random 
Fields (MRFs, aka Graphical Models) in statistics and machine learning, but there are also important 
differences. We overview the related work, and contrast it to ours, in Section [6j 
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2 System Model 



Most of the analytical results of this paper are for the classic and popular independent cascade model 
of epidemics; in particular we will consider the simple one-step model first proposed in [6j and recently 
popularized by Kempe, Kleinberg and Tardos pj. 

Standard independent cascade epidemic model [9]: The network is assumed to be a directed 
graph G = {V, E); for every directed edge {i,j) we say i is a parent and j is a child of the corresponding 
other node. Parent may infect child along an edge, but the reverse cannot happen; we allow bi-directed 
edges (i.e. it is possible that {i,j) and {j,i) are in E). Let Vi := {j : {j,i) G E} denote the set of parents 
of each node i, and for convenience we also include i € Vi- Epidemics proceed in discrete time; all nodes 
are initially in the susceptible state. At time 0, each node tosses a coin and independently becomes active, 
with probability Pmit- This set of initially active nodes are called seeds. In every time step each active 
node probabilistically tries to infect its susceptible children; if node i is active at time t, it will infect each 
susceptible child j with probability pij, independently. Correspondingly, a node j that is susceptible at 
time t will become active in the next time step, i.e. t + 1, if any one of its parents infects it. Finally, a node 
remains active for only one time slot, after which it becomes inactive: it does not spread the infection, and 
cannot be infected again. Thus this is an "SIR" epidemic, where some nodes remain forever susceptible 
because the epidemic never reaches them, while others transition according to: 

susceptible — t- active for one time step — t- inactive. A sample path of the independent cascade model 
is illustrated in Figure [Tj 




Figure 1: Illustration of the independent cascade model: This figure illustrates a sample path of 
the evolution of the independent cascade model. The four figures above represent the state of the system 
at time steps 0, 1, 2 and 3 respectively. A node with no box around it means that it is in susceptible state, 
a node with a square around it means that it is active and a node with a star around it means that it is 
inactive. At time step 0, nodes b and c are chosen as seeds. They infect d and / respectively and turn 
inactive. In time step 1, d infects a where as / fails to infect any of its children. In time step 2, a does not 
have any children to infect. Once a turns inactive in time step 3, the epidemic stops. 

Note thus that the set parental set is Vi = {j : pji > 0}, i.e. the set of all nodes that have a non-zero 
probability of infecting i. 

Observation model: For an epidemic cascade u that spreads over a graph, we observe for each node 
i the time when i became active. If i is one of the seed nodes of cascade u then = 0, and for nodes 
that are never infected in u we set tf = oo. Let denote the vector of infection times for cascade u. 
We observe more than one cascade on the same graph; let U be the set of cascades, and m = \U\ he the 
number, which we will often refer to as the sample complexity. Each cascade is assumed to be generated 
and observed as above, independent of all others. 
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(possible) Super-graph information: In several applications, we (may) also have prior knowledge 
about the network, in the form of a directed super-grapl^ of G. We find it convenient to represent super- 
graph information as follows: for each node i, we are given a set Si C V nodes that contain its true 
parents; i.e. Vi C Si for all i. In terms of edge probabilities, this means that pji > (strictly) for j € Vi, 
and Pji = for j G Si\Vi. Of course if no super-graph is available we can set Si = V, the set of all nodes; 
so from now on we assume a Si is always available. 

Problem description: Using the vectors of infection times {f^} we are interested in finding the 
parental neighborhood Vi, for some or all of the nodes i. That is, we want to find the set of nodes that can 
infect i. This is not possible when we only observe a single cascade; we will thus be interested in learning 
the graph from as few cascades as possible. 

Note that multiple seeds begin each cascade u £ U; thus, for a single cascade even at time step 1 we 
will not be able to say with surety which seed infected which individual. 

Correlation decay: Loosely speaking, random processes on graphs are said to have "correlation 
decay" if far away nodes have negligible effects. For our problem, this means that the cascade from each 
seed does not travel too far. Formally, all the results in this paper assume that there exists a number a > 
such that for every node i, the sum of all probabilities of incoming edges satisfies YlkP^i < 1 — ce- The 
following lemma clarifies what this assumption means for the infection times of a node. 

Lemma 1. For any node i and time t, we have 

^[Ti = t] < {l-a)'-^p^n^t 

Thus, the probability P[Tj < oo] that a node is infected satisfies pinit < < oo] < Also, the 

average distance from a node to any seed that infected it is at most ^. We discuss the case where there is 
no correlation decay in the Discussion section. 

Interpreting the results: Each cascade we observe provides some information about the graph. 
Suppose we want to infer the presence, or absence, of the directed edge (i, j) (i.e. if pij > or not). Note 
that if the parent i is not infected in a cascade, then that cascade provides no information about (i, j): 
since the parent was never infected, no infection attempt was made using that edge; the "edge activation 
variable" was never sampled. While our theorems are in terms of the total number m of cascades needed 
for graph estimation, for a meaningful interpretation of this number one needs to realize that the expected 
number of times we get useful information about any edge is, on average, between mpinn and mpinit/oi. 
These are also the bounds on the average number of times a particular node is infected in a particular 
cascade. 

We provide both upper bounds (via two learning algorithms), and (information theoretic) lower bounds 
on the sample complexity. Note that the execution of our algorithms does not require knowledge of these 
parameters like Pinit^ ol etc.; these are defined only for the analysis. 

■^For example, on social networks like Facebook or Twitter, we may know the set of all friends of a user, and from these we 
want to find the ones that most influence the user. 
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3 Maximum Likelihood 



The graph learning problem can be interpreted as a parameter estimation problem: for each cascade, the 
vector T of infection times is a set of random variables that has a joint distribution which is determined by 
a set of parameters pji > for every i and j E Si. We want to find these parameters, or more specifically 
the identities of the edges where they are non-zero, from samples t", u £ U. Each choice of parameters 
has an associated probability, or likelihood, of generating the infection times we observe. The classical 
Maximum-likelihood (ML) estimator advocates picking the parameter values that maximize this likelihood. 

Our crucial insight in this section is that, with an appropriate change of variables the likelihood 
function has a particularly nice (decoupled, convex) form, enabling both efficient implementation and 
analysis. In particular, define 9ij := — log(l — pij) ; note that pij = 9ij = 0. 

Further, for each node i let := {6ji ; j G Si} be the set of parameters corresponding to the possible 
parents Si of node i. Let 9 be the set of all parameters of the graph. Note that ^ > (i.e. every parameter 
is positive or zero) . Finally, we define the log-likelihood of a vector t of samples to be 

C{t;9) := log(Pre[r = t]) 

The proposition below shows how C decouples into convex functions with this change of variables. 

Proposition 1 (convexity & decoupling). For any vector of parameters 9, and infection time vector t, the 
log-likelihood is given by 

C{t-9) = iog(p|„,i(i-pi„,t)"-^) +^A(t5,;^«) 

i 

where s is the number of seeds (i.e. nodes with ti = 0), and the node-based term 
C'i{tSi;6*i) ■= - ^ % + log 1 1 - exp 1 - ^ 9ji 

j:tj<ti-2 \ \ j:tj=ti-l 

Furthermore, JCi{tSi',9*i) is a concave function of9*i, for any fixed tsi- 

Proof: Please see appendix. 

Remark: The overall log-likelihood jC{t; 9) has now decoupled because it is the sum of n terms of the 
form Ci{t; 9*i), each of which depend on a different set of variables 9*i. Thus each one can be optimized, 
and analyzed, in isolation. 

The algorithmic implications of this proposition are: 
f^aj if we are only interested in a small subset of nodes, we can find their parental neighborhood by solving 
a separate -variable convex program for each one, 

(b) even if we want to find the entire graph, the decoupling allows for parallelization, and speedup: solving 

n convex programs with n variables each is much faster than solving one program with variables. 

(c) The function Ci is fully determined by the times t^. of the node's super- neighborhood; it does not need 
knowledge of the infection times of other nodes. 
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Proposition[T]is equally crucial analytically, as it enables us to derive bounds on the number of cascades 
required for us to reliably select the neighborhood, via analysis of the first-order optimality conditions of 
the convex program. In particular, we will see that complementary slackness conditions from convex 
programming, and concentration results, are key to proving our results on the sample complexity of the 
ML procedure. 

The ML algorithm for finding the parental neighborhood of node i is formally stated below, it involves 
solving the convex program corresponding to the max-likelihood, and setting small values of 9ji to 0. The 
threshold for this cut-off is rj, which is an input to the procedure. 

Algorithm 1 ML Algorithm for Node i 
1: Find the optimizer of the empirical likelihood, i.e. find 

:= argmax Ci{ts.;9^i) 

U 

where Ci{tsr, O^i) is as defined in Prop. [Tj 
2: Estimate the parental neighborhood by thresholding: 



3: Output Vi. 



Our main analytical result of this section is a characterization of the performance of this ML algorithm, 
in terms of the number of cascades it needs to reliably estimate the parental neighborhood of any node i. 

Theorem 1. Consider a node i with true parental degree di := |Vi|, and super-graph degree Di := \Si\. Let 
Pi,min '■= ^^"^jeViPji be the strength of the edge from the weakest parent. Assume diPinit < ^- Then, for 
any 6 > 0, if the number of cascades m = \U\ satisfies 

^ ^ Pinit (a^^V?,™„) ( <^ ) 

Then, with probability greater than 1 — S, the estimate Vi from the ML algorithm with threshold r] will have 

(a) no false neighbors, i.e. ViCVi, and 

(b) all strong enough neighbors: if j € Vi and pji > f (e^^ — 1), then j G Vi as well. 
Here c is a number independent of any other system parameter. 



Remarks: 

(a) This is a non- asymptotic result that holds for all values of the system variables di^pinHj o:, pi^jnini V 
and 6. Appropriate asymptotic results can be derived as corollaries, if required. Note that this result on 
finding the nodes that influence node i does not depend on n. 

(b) We can learn the entire neighborhood, i.e. Vi = Vi, by choosing the threshold ?? < 5 log(l + °^'g""" ) 
low enough, and the corresponding number of cascade samples m according to ([T]). Thus, the number of 
times node i needs to be infected before we can reliably (i.e. with a fixed small error probability) learn 
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its neighborhood scales as 0{df log Di) (for fixed values of other system variables). Our result allows for 
learning stronger edges with fewer samples. 

(c) If we want to learn the structure of the entire graph with probability greater than e, we can set 
5 = e/n and then take a union bound over all the nodes. So, for example, if every node has true degree 
at most |Vi| < d, and super- graph degree \Si\ < D, then the number of samples needed to learn the entire 
graph (with probability at least 1 — e) scales as 0{(Plog^) (for fixed values of other system variables). 

(d) The average number of parents of i that are seeds is diPinit- If this is large, then in every cascade 
there will be a reasonable probability of one of them being seeds, and infecting i in the next time slot. This 
makes it hard to discern the neighborhood of i; the (mild) assumption diPinu < 5 is required to counter 
this effect. Indeed, in most applications pinit is likely to be quite small. 

(e) Note that our results depend on the in-degree of nodes, not the out-degree. So for example it is 
possible to have high out-degree nodes (as e.g. in power-law graphs), and still be able to learn the graph 
with small number of samples. 



3.1 Generalized Independent Cascade Model 

In this paper, for ease of analysis, we restrict our sample-complexity analysis to one-step independent 
cascade epidemics, where a node is active for only one time slot after it is infected. However, our algorithmic 
and bounding approaches apply to a more general class of independent cascade models. Specifically, we 
consider an extension where each parent now has a probability distribution of the amount of time it waits 
before infecting a child, and prove a generalization of Proposition [T| which was the key result enabling 
both the implementation and analysis of the ML algorithm. 

Formally, let pj^ denote the probability that an active node j infects a susceptible child i, r time steps 
after j was infected. The time taken for j to infect i is bounded by a parameter t i.e., pjj = for t > t. 
Note that if we have t = 1, we recover the standard independent cascade model. The total probability that 
j infects i is given by X^^gpjPjj (which can be strictly less than 1) where [t] denotes the set of integers 
between 1 and t (including the end points). 

Following in the steps of Proposition ll define 6'^,: = —log ( — ^''^[^1 ^ _ Note that given any 

|_| \^ 2^r£{T-l]Pji / 

parameter vector p'j- we obtain the corresponding and vice versa. Moreover OJ- = <^ p'j- = 0. Suppose 
each node is seeded with the infection with probability pinit and let C{t,6) denote the log- likelihood of the 
infection time vector t when the parameters of the model are given by 6. We have the following version of 
Proposition [1] for the generalized independent cascade model. 

Proposition 2. For any vector of parameters 9, and infection time vector t, the log-likelihood is given by 

C{t;d) = log{p!^,,{l-p,n^tr'n +Y.Q{tsr,e„) 

i 

where s is the number of seeds (i.e. nodes with ti = 0), and the node-based term 
Ci{ts,;e,i) := - Yl E ^Ji + log 1-exp 

j:tj<ti-2T&[ti-tj-l] \ 
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Furthermore, Ci{ts^; 9^) is a concave function of 6^,i, for any fixed ts-. 



Proof: Please see appendix. 



4 Greedy Algorithm 



We now analyze the sample complexity of a simple iterative greedy algorithm - for the case when the graph 
is a tre^ The algorithm is of course defined for general graphs. 

The idea is as follows: suppose we want to find the parents of node i from a given set of cascades U. In 
each cascade u, the set of nodes that could have possibly infected i is the set of nodes j for which = tf — 1. 
In the first step, the algorithm thus picks the j which has = — 1 for the largest number of observed 
cascades. It then removes those cascades from further consideration (since they have been "accounted for" ) 
and proceeds as before on the remaining cascades, stopping when all cascades are exhausted. 



Algorithm 2 Greedy Algorithm for Node i 
1: Initialize unaccounted cascades U = U 
2: Initiahze Vj = 
3: while [/ / do 

4: Find k = argmax^g^. |{m G C/ : t" = - 1}| 
5: Add it : Vi ^ Vj U A; 

6: Remove cascades : C/ ^ ?7 \ {n : = — 1} 
7: end while 
8: Output Vi 



Our main result for this section is below. 

Theorem 2. Suppose the graph G is a tree, and the degree of node i is di := \Vi\. Suppose also that 
Pinit < "'iQed" ■ V Algorithm^ is given a super-neighbhorhood of size Di := \Si\, then for any 6 > if the 
number of samples satisfies 

c / 1 \ , , Di 
m > ( di log — 

Pinit \Pmin J ^ 

then with probability at least 1 — 5 the estimate from the greedy algorithm will be the same as the true 
neighborhood, i.e. Vi = Vi. Here c is a constant independent of any other system parameter. 



5 Lower Bounds 



We now turn our attention to establishing lower bounds on the number of cascades that need to be observed 
for even approximately learning graph structure, using any algorithm. Clearly, we now cannot focus on 
learning just one graph, since in that case we could come up with an "algorithm" tailored to find precisely 

^We believe (especially since we have correlation decay) that our results can be easily extended to the case of "locally 
tree-like" graphs; e.g. random graphs from the Erdos-Renyi, random regular or several other popular models. 
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that one graph. Instead, as is standard practice in information-theoretic lower bounds, we need to consider 
a collection (or "ensemble") of graphs, and study how many cascades are needed to (approximately) find 
any one graph from this collection. 

We first state a lower bound in a general setting, for any pre-defined ensemble and notion of approxi- 
mate recovery. We then provide two corollaries specializing it to our independent cascade epidemic model, 
edit distance approximation, and two natural graph ensembles. 

General Setting: Consider any general cascading process generating infection times {Tj}. Let Q be 
a fixed collection of graphs and corresponding edge probabilities, and let G be a graph chosen uniformly 
at random from this collection. We then generate a set U, with = m, of independent cascades, and 
observe infection times T^. Let G{T^) be a graph estimator that takes the observations as an input and 
outputs a graph. Finally, we say that a graph G' approximately recovers graph G if G G B{G'), where 
B{G') C ^ is any pre-defined set of graphs, with one such set defined for every G' . 

So for example, if we are interested in exact recovery, we would have B{G') = {G'}, i.e. the singleton. 
If we were interested in edit distance of s, we would have B{G') be the set of all graphs within edit distance 
s of G'. 

We define the probability of error of a graph estimator G(-) to be 

Pe{G) := F[G ^ B{G{I^))] 

where the probability is calculated over the randomness in the choice of G itself, and the generation of 
infection times in this G. Note that the definition defines error to be when approximate recovery (as 
defined by the sets B) fails. 

Theorem 3. In the general setting above, for any graph estimator to have a probability of error of Pe, we 
need 

where H{-) is the entropy function. 

Proof. To shorten notation, we will denote G(T^) simply by G. The proof uses several basic information- 
theoretic inequalities, which can be found e.g. in [5J. In the following H{-) denotes entropy and /(•;■) 
denotes mutual information. 

We can see that the following diagram forms a Markov chain 

G^T^^G 
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We have the following series of inequalities: 



H{G) = I{G- G) + H{G I G) 

< I{G-T^) + H{G I G) 

(«) ^ 

< H{T^) + H{G I G) 

(«) 

< mH{T) + //(G I G) 

^< mY,H{Ti) + H{G I G) 



where (^i) follows from the data processing inequality, {^2) follows from the fact that the mutual information 
between two random variables is less than the entropy of either of them, ("Js) and (<J4) follows from the 
subadditivity of entropy. Since G is sampled uniformly at random from we have that H{G) = log \Q\. 
We now use Fano's inequality to bound H{G \ G). 

^ (ft) 

H{G I G) < H{G, Err \ G) 
(«) 

< H{Err \ G) + H{G \ Err, G) 

(?3) 

< H{Err) + H{G \ E,G) 

<^ 1 + Pe log \g\ + (1 - Pe) log sup |B,(G)| 

G 

where Err is the error indicator random variable (i.e., is 1 if G ^ ^(G) and otherwise), so that Pg = 
¥,[Err]. (<ji) follows from the monotonicity of entropy, ((^2) follows from the chain rule of entropy, (<J3) follows 
from the monotonicity of entropy with respect to conditioning and (^4) follows from Fano's inequality. 
Combining the above two results, we obtain 

m Y: m) > (1 - Pe) log - 1 

supg|^(G)| 

M 1 



(l-i'e)log 



supg|B(G)| 



□ 

To apply this result to a particular ensemble Q and notion of approximation B, we need to find a lower 
bound on \g\, and upper bounds on |;S(G')| for all G' and H{Ti) for all i. The following lemma states an 
upper bound on H{Ti) for our independent cascade model when we have correlation decay coefficient a. 
Both our corollaries assume this is the case for all graphs in their respective ensembles. 



Lemma 2. For any graph with correlation decay coefficient a, for any node i, and when Pinit < h, 



we 
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have that 

2 



Hm) < ^ log^ + f^i log 



1- a \ pinit \ a J I - a 

Pinit\ , /-. Pinit 



1 



-) log (] 



a / \ a 

= ■ PinitH{a,Pinit) 

Note that the edit distance between two graphs is the number of edges present in only one of the two 
graphs but not the other (i.e. the number of edges in the symmetric difference of the two graphs). Our 
first corollary is for the case when there is no super-graph information, and we want to approximate in 
global edit distance. 

Corollary 1. Let Qd denote the set of all graphs with in-degrees bounded by d, and B-y^G') be the set of all 
graphs within edit distance 7 of G' . Let pinit < \ ■ Then for any algorithm to have a probability of error of 
Pe, we need 

(l-Pe) 1-a / , n 7, n^X ^ 

m > ^= dlog ^log— -1 

Pinit H [a, Pinit) V d n J 



Proof. We have that 



log l^rfl = log ( ^ j = (1 + 0(1)) ndlog- 

2 



log|i3^(G')| <log(^2)^ <7log 



n 

7 

Using the above two equations along with Theorem [3] and Lemma [2] gives us the result. □ 



Note that the number of times a node is infected thus needs to be Vt{{d — ^) logn) (since it is of the 
same order as mpinit)- For exact recovery, i.e. 7 = 0, we see that our result on the performance of our ML 
algorithm - specialized to the no prior information case D = n - is off by just a factor d in terms of the 
number of samples required. 

The second corollary is for the case when we do have prior supergraph information. In particular, we 
assume that we are given sets Si, of size \Si\ = D, for each node i. We consider the ensemble GD,d of all 
in-degree-d subgraphs of this fixed supergraph. Thus for each node, we need to learn the d parents it has, 
from a given super-set of size D. Finally, for each node i we allow Sj errors; let Bs{G') be the corresponding 
set of all subgraphs of the given supergraph. 

Corollary 2. For any estimator to have a probability of error of Pe in the setting above, the number of 
samples m must be bigger than 

(1-Pe) 1-a D 1 V- , eD , , ^ 

=; dlog— y Silog hlogmax(si,l) - 1 

Pinit H {a, Pinit) \ d Si j 

Remark: Specializing this result to exact recovery (i.e. Si = 0) removes dependence on n, and again 
shows us that the ML algorithm is within a factor d of optimal for the case when we have a super-graph. 
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Proof. We have the foUowing bound on the size of the ensemble: 



Similarly, 



where 



log I I = log 



[l + o{l))ndlog 



D 



log\Bs{G)\<logllij2 



iev \i=o 



D 



< log (max(l, Si) I 
iev ^ 

- X] ^max(l, Si) (^^^ ^ 



log max(l, Si) + ^ Si log 



De 



iev 



iev 



Bs{G) = {Gegd■■V^AV^<S^'^i£V} 



(3) 



Note that in the second inequality we assume Si < ^ because otherwise if d < we can choose Vj = 
and if d > ^ , we can choose Vi = Vj . Using Theorem [s] ([s]) and Lemma [2] gives us the first part of the 
result. □ 



6 General SIR Epidemics: Markov Graphs and Causality 

In this section we consider a much more general model for SIR epidemics/cascades on a directed graph, 
and establish a connection to the classic formalism of Markov Random Fields (MRFs) - see e.g. |12| for 
a formal introduction. Specifically, we show that the (undirected) Markov graph of infection times of an 
SIR epidemic is obtained via the moralization of the true (directed) network graph on which the cascade 
spreads. A moralized graph, as defined below, is obtained by adding edges between all parents of a node 
(i.e. "marrying" them), and removing all directions from all edges. Graph moralization also arises in 
Bayesian networks, and we comment on the relationship, and the role of causality, after we present our 
result. 

We first briefly describe our general model for SIR epidemics, then define its Markov graph, and finally 
present our result. 

General SIR epidemics: We now describe a general model for SIR epidemics propagating on a 
directed graph. Nodes can be in one of three states: for susceptible, 1 for infected and active, and 2 for 
resistant and inactive; we restrict our attention to discrete time in this paper. Let Xi{t) be the state of 
node i at time t, and X{t) to be the vector corresponding to the states of all nodes. We require that this 
process be causal, and governed by the true directed graph G, in the sense that for any time step t, 

F[X{t)=xit)\x{0:t-l)] = l[F[X,{t)=Xi{t)\xv,{0:t-l)] (4) 

i 
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where the notation a;(l : t) = {x{s), 1 < s <t} is the entire history upto time t, and as before Vj is the set 
of parents of node i, and includes i £ Vi as well. Note the above encodes that the probability distribution of 
each node's next state depends only on the history of itself and its neighbors, but is otherwise independent 
of the history or current state of the other nodes. We assume that the cascade is initially seeded arbitrarily, 
i.e. x{0) can be any fixed initial condition. 

For each node i, let T^^^ be the (random) time when its state transitioned from to 1, and Tj^"^^ for 
the time from 1 to 2 (of course, if neither happened then we can take them to be oo). Let Tj = {T-^\ T^"^^) 
be the summary for node i's participation in the cascade. 

Markov Graphs: Markov random fields (MRFs, also known as Graphical Models) are a classic 
formalism, enabling the use of graph algorithms for tasks in statistics, physics and machine learning. 
The central notion therein is that of the Markov graph of a probability distribution; in particular, every 
collection of random variables has an associated graph. Every variable is a node in the graph, and the 
edges encode conditional independence: conditioned on the neighbors, the variable is independent of all 
the other variables. For our purposes here, the random variables are the T := {Tj,z S V}. We say that an 
undirected graph G' is the Markov graph of the variables T if their joint probability distribution, for all 
factors as follows 

nT = t\ = w fc{tc) 

for some functions fc] here C is the set of cliques of G', and for a clique c E C , tc ■= {ti,i E c} is the 
vector of node times for nodes in c. 

We need one more definition before we state our result. 

Moralization: Given a directed graph G, its moralized graph G is the undirected graph where two 
nodes are connected if and only if they either have a parent-child relationship in G, or if they have a 
common child, or both. Formally, undirected edge is present in G if and only if at least one of the 
following is true 

(a) directed edges or (j, i) are present in G, or 

(b) there is some node k such that (z, k) and (j, k) are present in G (i.e. is a common child). 
Figure [2] illustrates the process of moralization with an example. 




(a) Directed graph G (b) Moralized graph G of G 



Figure 2: An example of moralization 
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Theorem 4. Suppose infection times T are generated from a general SIR epidemic, as above, propagating 
on a directed network graph G. Let G he the (undirected) moralized graph of G. Then, G is the Markov 
graph ofT. 

Remarks: The main appeal of this result arises from the generality of the model; indeed, it may be 
possible to learn the moralized graph even when we may not know what the precise epidemic evolution 
model is, as long as it satisfies Q. In particular, related to the focus of this paper, there has been 
substantial work on learning the Markov graph structure of random variables from samples. In our setting, 
each cascade is a sample from the joint distribution of T, and hence one can imagine using some of these 
techniques. Markov graph learning techniques can generally be divided into 

(a) those that assume a specific class of probability distributions: see e.g. [I5l [21] for Gaussian MRFs, 
|201 [2j for Ising models, [8] for general discrete pairwise distributions. These typically require knowledge of 
the precise parametric form of the dependence, but then enable learning with a smaller number of samples. 
(h) distribution- free algorithms, usually for discrete distributions and based on conditional independence 
tests [H m [I?]. These do not need to know the parametric form a-priori, but typically have higher 
computational and sample complexity. 

Causality: It is interesting to contrast Theorem |4] with the other results in this paper. In particular, 
on the one hand. Theorems [T] and [2] utilize the fact precise causal process that generates T to find the exact 
true directed network graph. On the other hand, applying a Markov graph learning technique directly to 
the samples of T, without leveraging the process that generated them, only allows us to get to the moralized 
graph. It thus serves as a motivating example to extend the study of graph learning from samples to causal 
phenomena, in a way that explicitly takes into account time dynamics. 

Moralization also arises in Bayesian networks; this is an alternative formulation that associates an 
acyclic directed graph with a probability distribution. In that setting, the undirected Markov graph is also 
the moralization of this directed graph. We note however that our original true network graph G can have 
directed cycles; in our setting the moralization arises from (ignoring the) causality in time. 

7 Experiments 

As an initial empirical illustration of our results, in this section, we present - via Figures |3j |4j [5] and [6] - 
empirical evaluations of both the ML and Greedy algorithms on synthetic graphs, and sub-graphs of the 
Twitter graph. In all cases, for the ML algorithm the threshold rj was picked via cross-validation. 

8 Summary and Discussion 

This paper studies the problem of learning the graph on which epidemic cascades spread, given only the 
times when nodes get infected, and possibly a super-graph. We studied the sample complexity - i.e. the 
number of cascade samples required - for two natural algorithms for graph recovery, and also established 
a corresponding information-theoretic lower bound. To our knowledge, this is the first paper to study 
the sample complexity of learning graphs of epidemic cascades. Several extensions suggest themselves; we 
discuss some below. 
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Total number of infections Average number of infections per node 

Figure 3: Interpreting sample complexity: As mentioned in Section[2j and re-inforced by our theorems, 
consistent structure recovery is governed not so much by the total number of infections m in the network, 
as by the number of times a node is infected (which is approx. Piniti^)- This figure provides some empirical 
validation of this claim; the plots on the left and right are from the same set of experiments using the ML 
algorithm (and no super-graph information). On the left we plot the probability of successful recovery of 
the entire graph as a function of m, while on the right we plot it as a function of the average number of 
times a node was infected; for several different sizes of 2-d grids. On the left, we see that the total number 
of cascades varies noticeably with grid size, but the average number of infections does not. This squares 
with Theorem [T| since in all these graphs the d is the same, and logn does not vary much either. 

Observation Model: In this paper it is assumed that we have access to the times when nodes get 
infected. However, this may not always be possible. Indeed a weaker assumption is to only know the 
infected set in each cascade. To us this seems like a much harder problem, e.g. it is now not clear that 
there is a decoupling of the global graph learning problem. 

Decoupling: A key step in our ML results is to show that the global graph finding problem decouples 
into n local problems. Our proof of this fact can be extended to any causal network process - i.e. any 
process where the state Xi{t) depends only on xy, (i — 1) - under the assumption that we can reconstruct 
the entire process trajectory from our observations (so e.g. the weaker observation model above would not 
fall into this class). In particular, it holds for more general models of epidemic cascade propagation as 
well; we focused on the discrete-time one-step model as a first step. 

Correlation decay: Our results are for the case of correlation decay, i.e. when the cascade from one 
seed reaches a constant depth of nodes before extinguishing. Equally interesting and relevant is the case 
without correlation decay, when the cascade from each seed can reach as much as a constant fraction of 
the network. We suspect, based on experiments, that our algorithms would be efficient in this case as well; 
however, a proof would be technically quite different, and interesting. 

Greedy algorithms: As can be seen in our experiments, the greedy algorithm performs quite well 
even when the graph is far from being a tree (i.e. has several small cycles). It would be interesting to 
develop an alternate and more general proof of the performance of the greedy algorithm. We also note that 
one can easily formulate greedy algorithms in more general epidemic settings; this would involve iteratively 
choosing the parameter that gives the biggest change in the corresponding likelihood function. 
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Figure 4: Effect of super-graph information: The presence of super-graph information can reduce the 
number of node infections (and hence cascades) required to learn the graph. Here we plot the probability 
of successful recovery for a 200-node random 4-regular graph, for the ML and Greedy algorithms, for two 
scenarios: when we are given a super-graph of regular degree 8 that contains the true graph, and when we 
are not given such information. We can see that the extent of reduction in sample complexity is moderate, 
reflecting the fact that the effect of super-graph information is logarithmic (i.e. log-D vs logn). 
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Figure 6: Dependence on degree for Twitter graph: This figure is a scatter plot where for each node 
we plot its degree, and the number of times it was infected before ML or Greedy succeeded in finding its 
neighborhood. The graph is a 300 node graph was extracted from Twitter (with edges made as explained 
in Figure [5]); now however this is treated as the true graph to be learnt, and the algorithm is given no 
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A Correlation decay 

Proof of Lemma^ We establish this by an induction on the number of nodes n in the graph. If n = 1, 
the statement above is obvious. Suppose the statement above is true for all graphs which have upto n — 1 
nodes. Consider now a graph G that has n nodes. Consider any node i. The statement of the proposition 
is clearly true for t = 1. For t > 1, consider the probability that i is infected by a parent k £ Si at time 
step t. This can be upper bounded as follows: 

Fg [k infects i at time t] < [T^ = t — 1] p^j 

where G := G\i is the graph without node i, Pg denotes the probability when the graph is G, and similarly 
for Pg. The second inequality follows from the induction assumption, and the fact that if a is the decay 
coefficient for G, it is also for G. Taking a union bound over k £ Si now gives us the statement of the 
theorem for G: 

[Ti = t] = ^ Pg [A: infects i at time t] 

keSi 

< (1 - a)*"^Pinit ^ Pfci 

< (l-a)*-Vinit 

The bounds on P[Tj < oo] follow simply from summing this geometric series. □ 

B Maximum Likelihood 
B.l Proof of Prop. [1] 

Let Xi^r) = if i is susceptible at time r, 1 if i is active at time r and 2 if z is inactive at time r. Let 
X(r), T = 0, • • • , n be the corresponding vector process. Note that X{t) is a Markov process, and there 
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is a one to one correspondence between the set of infection times t and sample path x{t) of the process 
X{t). 

Given t, let x^(r) be the corresponding vector process. In particular, 



Then, 
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because each node gets infected independently from each of its currently active neighbors. Thus we have 
that 

F[T=t]= (1 - n ( n <A 

iey \r=i / 

where ai(r) = Fq [Xi(r) = x°(T)|X(r - 1) = x°(r - 1)] . It is clear that for r > tj, ai(T) = 1. For r = t^, 
aj(r) is the probability that at least one of its active nodes at time tj — 1 infected node i. Thus, 

ai{ti) = l- n exp(-0ji) (6) 

j:tj=U-l 

Finally, for each r < tj, aj(r) is the probability that active nodes at time r — 1 failed to infect node i. The 
set of all nodes that were active but failed to infect susceptible node i is {j : tj < — 2}. So we have 

lla,iT)= n exp(-%) (7) 

T<ti j--tj<ti-2 

Putting ([5]), ^ and ([T]) together and taking log gives the result. 

Concavity follows from the fact that log(l — exp(— x)) is a concave function of x, and the fact that if 
any function /(x) is a concave function of x then /(X^j Oi) is jointly concave in 6. ■ 
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B.2 Proof of Theorem [T] 



We focus on the recovery of the neighborhood of node i. For brevity, we will drop i from sub-scripts; thus 
we denote by 9, Vi by V and Si by S, and di, Di by d, D. Let 9* be the true parameter values. Define 
the empirical log-likelihood function by 

L{9): =^Y.c,{e-e) 

m ^-^ 

u 

Note that the ML algorithm finds 9 = argmaxg L{9). Also let L{9) : = Eg* [Ci{T, 9)]. 

Idea: Note that as the number of samples m increases, L ^ L. Also, we know that 9* = argmin^ L{9); 
this is just stating that the expected value of the likelihood function is maximized by the true parameter 
values, a simple classical result from ML estimation [19]. Thus when L ~ L, their minimizers will also 
be close; i.e. 9* ~ 9. However, they will not be exactly equal; hence hope then is to have subsequent 
thresholding find the significant edges. The challenge is in establishing non-asymptotic bounds that show 
that m scales much slower than n (the network size) or D (the size of the super- neighborhood). 

Roadmap to the proof: 

(a) In Proposition [3] we provide an expression for the gradient \j jL{9*) of the expected log- likelihood 
evaluated at the true parameters 9* . This can be used to show that \jjL{9*) = for the true neighbors 
j E V, and for the others we can show that \jjL{9*) < for j ^ V. 

(b) Note that if we had similar relationships hold for the empirical likelihood, i.e. if \jjL{9) = for 
j £ V and \jjL{9) < for j ^ V, then we would be done; this is because by complementary slackness 
conditions we would have that 9j > for j £ V and 9j = otherwise: the non-zero 9j would then 
correspond to the true neighborhood. Of course, these relationships do not hold exactly; the rest of the 
proof is showing they hold approximately, and the neighborhood can be found by thresholding. 

(c) As a first step to analyzing \/jL{9), in Lemma [s] we establish concentration results showing that 
an intermediate quantity '\jjL{9*) is close to '\/jL{9*), and hence we can show that | Vi L{9*)\ < a for 
j £ V (i.e. the gradient is small for the true neighbors), and \J jL{9*) < —b for j ^ V (i.e. the gradient is 
negative for the others). Here a and b depend on the system parameters, and a depends on the threshold 
•q as well, with a — ?■ as r/ — )• 0. This latter dependence is important as it shows that once the number of 
samples m becomes large, we can choose ij small and get exact recovery. 

(d) In Lemma we provide an upper bound on the value of 9j for j G V. We need this to not be too 
large for the next step. 

(e) In Lemmajsjwe derive an upper bound on the total value Sj^v non-neighbor parameters 

in 9. This upper bound implies that no non- neighbors will be selected after thresholding at r], completing 
the proof of the first claim of the theorem. 

(f) Finally, in Lemma [g] we show that, for true neighbors j G V, if the true p*j > ^{e"^^ — 1) then 

9j > rj, and will thus be estimated to be in the true neighborhood. This completes the proof of the second 
claim of the theorem. 
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Proposition 3. 



VjL{0*) = -P [Ti > Tj ; Tk^Tj V A; G V] 



(8) 



Proof. Taking the derivative of L(-) with respect to 6j, we obtain 



-l{T,<Ti-2} + 



1 



{Tj=Ti-l} 



exp(Efe:rfc=Ti-i^fe) - 1. 



Let be the sigma algebra with information up to the (random) time Tj. By iterated conditioning, we 
obtain 



-E 



E 



1 



1 



{Tj<Ti-2} 



{Tj=Ti-l} 



exp yz^k: Tk=Ti-l '^ki 

Since the event {Ti < Tj} is measurable in Txj, we have 

1 



(9) 



E 



1 



{Tj<Ti-2} 



L{T,=Ti-l} 



exp(Efc:r,=T,-i^L) -1 
On the other hand, if {Tj > Tj}, we have 



if Tj < Tj 



(10) 



E 



l{T,<Ti-2} 



4t,=t,-i} 



exp(Efe:T,=T,-l^:i) -1 

[Ti > T,- + 2 I Jr,] - K 



.exp(Efc:r,=T,-i^L) -1 
Considering the two terms above separately, we see that 



{T;=Ti-l} 



P[T,>T, + 2|^T,]=exp - 

which follows from the fact that the probability that (active) j failed to infect (susceptible) i is equal to 
the probability that all the nodes that were active at Tj failed to infect i. For the second term, we have 
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1 



{T,=Ti-l} 



exp(Efc:r,=T,-i^L) -1 
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{Tj=T,-l} 



-E 



1 



1 



{Tj=Ti-l} 



exp(Efe:T,=T,^L) -1 



I ~ X] ^fci I ^{3ifc6V S.t. Tk=Tj} 
k: Tk=Tj 
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where (?i) follows from the fact that {k : T/^ = Tj} is measurable in J-Tj and (^2) follows from the fact that 
Ti = Tj + 1 if and only if at least one of the parents of i were active at Tj and succeeded in infecting i. 
Combining the above two equations, we obtain 



E 



1 



1 



{T,<T,-2} 



{T,=T,-1} 



exp 



k: Tk=Ti-l ^si 



^{Tk^Tj V kev} if Ti > Tj 



Combining Q, ([To]) and ((TT]) 



'[T>Tj-Tky^Tkyk€V] 



(11) 



(12) 

□ 



An easy corollary of Proposition [3] is that if j is a parent of i, then the gradient with respect to 9j is 
zero since the probability above needs none of the parents of i to be infected at the same time as j. On 
the other hand, if j is not a parent of i, the gradient is strictly negative since the probability on the right 
hand side is strictly positive. 



^jLi9*) = 0iij eV 

ViL(r)<oifj^v 



(13) 
(14) 



We now state our concentration results. For any j, let x/jL^O) be the partial derivative of L{6) with respect 
to Oj. For j G V, let 

mi,,-: = \{u : = - 1 k tt ^ - ly k eV\j}\ 
be the number of cascades where j is the sole infector of node i and 

maj: = |{u : t" < - 2} | 
be the number of cascades where j is infected at least two time units before i. 



Lemma 3. For m > — ^ , 

Pinit \ a 



) ^?l0g ( '^^ ^""^ ^^"^ 



(a) vMe*] 



< a for j £ V where a :- 



(b) \/jL{6*) < -b for j ^ V where b := 

(c) CiP*j < "T-iJ < for j G V where ^1 



a VPinit 
144d 



apinit 
16 



(d) 6 < rn2j < ^2 for j G V where 6 : = x log f and ^2 : 



l^logf andp*: 



l-exp(-0*: 



with probability greater than 1 — 5. 
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Proof. For simplicity of notation we denote the number of samples as m 



Clogf 

Pinit 



where C 



and D = Di. We will first prove (c). First, we note the following bounds for independent Bernoulli random 
variables Xi where is the mean of the sum of Xi. 



> {1 + k)h 



< 



exp(— k) \^ 
(1 - ) 



< 



1 + K 



(15) 
(16) 



So as to be able to use the above inequalities, we first establish bounds on the expected value of mij. 

where the bound uses the probability that j is infected at time and neither i nor any of its other neighbors 
are infected at time and j infects i at time 1. Similarly, we have 

E [mi j] < mF [Tj < oo] < y 



where we use Lemma [T] Now applying (15) to mij we obtain 



< (1 - h^^iP 



< 



exp 



a) 



< 



8D 



Similarly applying (16) to mij gives us 



"ii,j>(l + l)| 



< 



2 



«^ 5 
^8D 



This proves (c). The proof of (d) is similar. 



We will now prove (a). Fix any j G V. Let Uj = {u eU : T" < oo}. Since E [\Uj\] > pmitm = Clog ^, 
using (15), we obtain 



m < 



Clogf 



< 



16D 



(17) 



Similarly since E < Ei^m = ^ log ^, using (16), we obtain 



m > 



2Clog 



a 



< 



Define the random variable 



Zj - -i{Tj<n-2} + 



IQD 



L{T,=T,-1} 



(18) 



exp Efe:T,=T,-l^fc -1 
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Note that we have the following absolute bound on Zj 



\ZA < 1 + 



1 



1 



exp(^*) - 1 p* 



where p*, = 1 — exp {—Oj) and also 



V: 



L{e*) = -y z^i = -y z^i 



(19) 



where is the realization of Zj on infection u. 



> a 



1 

m 



ueu 



> a 



> ma 



At this point we could apply Azuma-Hoeffding inequality to bound the above probability. However, the 
scaling factor in the exponent will be ma^ which gives us an extra pinit- To avoid this, we bound the above 
quantity as follows: 



> ma 



< 



m > 



2Clogf , Clogf 



a 



or mi < 



+ 



E 



Cloi 



, D 



\Uj\ = s; 


E^" 


> ma 









2Clogf 



^ ^+ E E ip[^. = f^.]ip 

ClogD Uj:\U,\=s 
c — y— 



E^" 



> ma 



(20) 



where Uj varies over all the subsets of and (?i) follows from (17) and (18). Focusing on the last term, 
we first note that ZJ are still independent random variables for u €z Uj. Since E [Zj] = from (13), we can 
apply Azuma-Hoeffding inequality and using ( 19 ) we obtain 



E^" 



> ma 



Uj = Uj,\Uj\ = s 







< 2 exp 


— {ma) 





< 



16D 



(21) 



where (<^i) follows from the fact that s < 
for any j ^ V, 



2C1 



E[Z,] = v,Lie 



-. The proof of (b) is on the same lines after noting that 



< -Pinit (1 -Pinit) < TT- 



(22) 



where (?i) follows from Proposition [3| ((^2) follows from the fact that the probability when j is infected 
before i and none of the parents of i are infected at the same time can be lower bounded by the case where 
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j is infected at time and neither i nor any of its parents are infected at time 0. {(^s) follows from the 
assumption that pinit < ^ and hence (1 — Pinit)'^'''^ > 5- Using (22) and Lemma [l| we obtain 



E[Zj\Tj <oc] 



'[T,-<oo] 



<= 



-a 



(23) 



Using (20) it suffices to show that 



Y,Z^> -mb 



bij = Uj,\Uj\ = s 



< 



16D 



for 



Clogf 



^ < s < 



2Clog^ 



follows. 



. An application of Azuma-Hoeffding inequality gives us the required bound as 



Y,Zf> -mb 



Uj = Uj,\Uj\ = s 



(?2) 
< 



^ - sE [Z,] > _ [Z, 



16 



u, = u„\u,\ 



sr^ , r -, Coi log X 
Y,ZJ-sE[Z,]> ^ 



Uj = Uj,\Uj\ = s 



I 



< exp 



\ 



V 



2C log 



< 



161) 



where (ft) follows by subtracting sE \Z,j\ from both sides of the inequality for which we are bounding 
the probability, (1^2) follows from the fact that s > ^ and (23) and (<J3) is an application of the 



Azuma-Hoeffding inequality using (19) and the fact that s < 



2Clog^ 



□ 



Lemma 4. When (a)-(d) in Lem,ma\^hold, maxjgv < |j 



Proof. Let k = argmax^gy If 6^ = 0, we are done. So assume 9k > 0. By the optimality of 6, we see 
that 

Vk W) = (24) 
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On the other hand, we have 



-In 



m 



1 

< — 

m 

(f2) 1 

< — 

m 



-m2,k + 
-6 + 



1 



exp(6'fc) - 1 
1 



exp(6'fc) - 1 



m 



(25) 



where (?i) follows from the definition of mi^fc and the fact that on the infections corresponding to mi^^, we 
have 



J 7 I 



and (<J2) follows from Lemma pi Putting (24) and (25) together, we obtain the result. 



□ 



Lemma 5. When (a)-(d) in Lemma\^hold, Yl,j0>^j - x(|2^+^°S^) < V 



Proof. Since L{9) is concave, the subgradient condition at 9* gives us the following 



L{e)-L{e*) < [vL{e*),i 
(a) 



yv^L{9*),9vc) + {vvL{9*), 9v - 9^ 
? -6||^vHli+«ll^v-^vlli 

^vl |oo 



(26) 



where (ft) follows from the fact that 9yc = and (<J2) follows from the fact that > and Lemma [3| The 
optimality of 9 gives us 



L{9)-L{9*) > 
Finally we have the following bound on 1 1 1 1 oo ^ 



1 

a 



(27) 



(28) 



e* = -log(l-p*) <log 

Using ([26]), ([27|), ^ and Lemma [I] proves the first inequality, that Ej^v % ^ X (|i + The 
second inequality, that x (f^ + ^) < ^, is easy to see. □ 
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Lemma 6. When (a)-(d) in Lemma hold, for every j ^ V we have that 9j > log ^1 + ~ V where 

p* = l-eMO*). 

Proof. Since 9j > 0, by the optimality of we have 

VM^) < (29) 
On the other hand, we have the following bound on the gradient 

1 / 1 

- ~ I ~"^2j H p ^ 

exp l^j + ||6'vc||i 



m 



(^2 ) 1 

> - I -?2 + 7^ X P*A I (30) 

exp [Oj + ||6'vc||i - 1 



m 



where (<;"i) follows from the fact that on the infections corresponding to mi^k, we have 

fc t 

and {(^2) follows from Lemma [s] Combining (29), (30) and Lemma [S] gives us the result. □ 

Thus we see that if the true parameter pj- > ^{e"^^ — 1), then dj > rj and thus will be in the estimated 
neighborhood A/i. This completes the proof of Theorem [l] 

C Greedy algorithm 
C . 1 Proof of Theorem H 

To simplify notation, we again denote V?, by V, Si by S and so on. From Lemma [l| we have that for every 
node J, 

pm <oo] < ^ 

Since the graph is a tree, for every node j there exists a unique (undirected) path between i and j. All the 
nodes on this path are said to be ancestors of j. Consider a node j G 5 \ V. Let A; G V be the ancestor of 
j on this path. Then we have that 



[Tj / Tfc; Tfc = Ti-l]> Pinit (1 - PinitfP: 



2 „ 
''min 
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If / G V but is not an ancestor of j then 

P [Tj = Tz = T, - 1] < P [Tj < oo]F[Ti < oo] < (^)^ 
since Tj and T/ are independent conditioned on Tj ,Ti < Ti. For any event A that depends on the infection 



times, let N{A) denote the number of cascades in U in which event A has occurred. Using (15) and (16) 
we have the following bounds on probabilities of error events: 



N {Tk = Ti-l) < ( 1 - ^ ) mpminPimt (1 " Pinit) 



< 



'"PminPinit(l-Pinit) 
A\ 2 



N {Tk = Ti = Ti-l)> 

N {Tj = Ti = Ti-l)> 
1 



"T-PinitPmin 
M 

"T-PinitPmin 



< 



< 



\ »d J 



'"PinitPmii 



„ . ™PinitPmin 



( '"PinitPmin 
V 8d 



N (Tj ^Tk;Tk = Ti-l) < 1 - - mpinitPmin (1 - Pinit) 



<(i) 



2 N mPinit Pmin ( 1 "Pinit ) 



(31) 



where k,l £ V and j ^ V such that k is an ancestor of j. Substituting the value of m from the statement of 
Theorem [2] and recalling the assumption on pmit, we see that with probability greater than 1 — 5, we have 



D 



N{Tk = l 


i - 1) > 


N{Tk = Ti = l 


^, - 1) < 


N {Tj =Ti = l 


i - 1) < 


N{Tj^n;n = 'l 





Cd{l- Pinit) log -g 

2 

clogf 



8 

Cd{l -pinit)^logf 



(32) 
(33) 
(34) 
(35) 



Note that the assumption on pinit also yields an upper bound of jq on pinit- Now we will show that 
under the above conditions, Algorithm [2] recovers the original graph exactly. Suppose in iteration s, the 
neighborhood is s — 1 of the correct parents and there is atleast one k G V, not in the current neighborhood. 



Let the current set of infections be U. Then from l\32n and (33), we see that 



Nu{Tk 



1) > 



Cd{l -pinit)logf 



clogf 
8 



D 



cdlogj 



(4(l-pinit)-l) >0 



So there is atleast one node that will be added to the neighborhood. Now consider any j ^ V. If the 



ancestor of j that is a parent of i has already been added to the neighborhood list, then from (34) 

Nu (T, = : 



1) <d 



clogf 



< (4(l-pinit)-i; 

< Nu {Tk = Ti-l] 



cdlogf 
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Suppose the ancestor of j that is a parent of i has not yet been added to the neighborhood of i. Without 
loss of generality, let k be the ancestor of j. Then, 



Nu in = Ti-l)-Nu {Tj = T, - 1) 
= Nu{Tj^Tk;Tk = Ti-l) 
-Nu{Tj=Ti=Ti-l:l^ k, I G S) 



> 



Cd(l -pinit)^logf Clogf 



2d 



2(1 -p 



init I 



1 > 



Applying union bound over all nodes in the superneighborhood, we can conclude that all nodes in 
the superneighborhood satisfy (32), (33), (34) and (35) with probability greater than 1 — 5. This proves 
Theorem H 



D Lower Bounds 



D.l Proof of Lemma [2] 



Recall from Lemma[l]that P [Tj = t\ < {1 — of ^ Pinit- The proof just involves using this to bound H{Ti] 
Since pinit < ^ j we have the following 



HiT,) = -Y^F[T, = t] log P [Ti = t] 

t=i 

-P [T, = oo] logP[ri = oo] 

n 

< - (1 - ")*"^ Pinit log (1 - a)*"^ Pinit 



t=l 



1-^llogfl-^ 



a / \ a 

2 



'1' ^L,^ + (i^V,„, 1 



1 — Q \ Pinit V / 1 — a 

1-^hogfl-^ 



a / \ a 
where (<Ji) follows from some algebraic manipulations. 



31 



E Generalized Independent Cascade Model 



E.l Proof of Prop. [2] 

Defining 

and proceeding as in the proof of Proposition [T| we obtain 

n 

Pg[r = t] = [X(0) = j;°(0)] X JJPe[X(r) = x°(T)|X(0:r-l)=3;°(0:T-l) 



if r < ti 

1 if r > h 



T=l 

Pfnit (1 -Pinit)"~^ Also, 



where X(0 : r) denotes the (joint) values of the vectors X(0),-- - ,X{t). Now, Pg [-'^(O) = 



Pe [X(t) = x\t)\X{Q : t - 1) = ^0(0 : r - 1)] = \{¥e [X,{t) = xO(t)|X(0 : r - 1) = xO(0 : r - 1)] 

because each node gets infected independently from each of its currently active neighbors. Thus we have 
that 

^[T=t\= (1 - n ( n ^^(^) ) (36) 

iev \t=i J 

where 6i(r) = Pg [Xi(r) = x°(t)|X(0 : r - 1) = x°(0 : r - 1)] . It is clear that for r > bi{T) = 1. For 
T = ti, bi{T) is the probability that at least one of the parents j of i infected before ti infected node i at 
time ti given that j did not infect i before tj. Thus, 

(..(*.) = 1 - n /^""-'t 

= 1 - n exp (-e*r*0 (37) 

j:tj<ti 

Finally, for each t < ti, bi{T) is the probability that active nodes at time r — 1 failed to infect node i. The 
set of all nodes that were active but failed to infect susceptible node i is {j : tj < ti — 2}. Each such node 
j failed to infect i for ti — tj — 1 time slots. So we have 

n^^(-)= n (i- E p'^^ 

T<U j-tj<ti-2 \ re[ti-tj-l] 

= n n «-p(-^.^o (38) 

j:tj<ti-2re[ti-tj-l] 



Putting (36), (37) and (38) together and taking log gives the result. 



Concavity again follows from the fact that log(l — exp(— x)) is a concave function of x, and the fact 
that if any function /(x) is a concave function of x then fi^i Oi) is jointly concave in 0. □ 
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F Markov Graphs and Causality 



F.l Proof of Theorem |4] 

We will show that P [T = t] can be written as a product of various factors where each factor depends only 
on some i &V. Given any vector t, for every i £ V define the infection vectors 



Xi(T 



f if < T < ^ 

1 ift«<r<tF 



2 if T > t 



(2) 



We can see that there is a one to one correspondence between valid time vectors t and valid infection 
vectors x. Using the above transformation, we can calculate the probability of a given time vector t as 
follows: 

F[T = t]=F[X = x] 

oo 

= P [X{0) = x{0)] X P [X{s) = x{s) \x{0:s- 1)] 

s=l 



n IP [^^(0) = x^m ] xfiH^ iMs) = Xi{s) \xv,{0:s- 1)] 



\iev 



=1 iev 

oo 



= m P[X,(0) = x,(0)] X J]J]P[Xi(s) = Xi(s)|xv,(0:s-l)] 
Kiev J ieVs=i 

where /.(tyj = P [^^(0) = x,(0)] x H^iIP = \ xv^O : s - 1)]. 



□ 
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